Colombia COVID-19 - Central region
Our project
We decided to do central Colombia, basically because it is where the capital is.
We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.
We decided to consider as central Colombia the following departments/districts: Bogotà DC, Boyacá, Tolima, Cundinamarca, Meta, Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá, Antioquia, Santander, Casanare.
Loading the dataset
This dataset is missing completely the department Valle del Cauca!
colombia_covid <- as.data.frame(read_csv("data/covid19co.csv"))
cols <- colnames(colombia_covid)[c(1, 4, 5, 6, 7, 8, 9, 11, 14)]
colombia_covid <- colombia_covid[cols]
colombia_covid <- colombia_covid[, c(1, 9, 2, 3, 4, 5, 6, 7, 8)]
colnames(colombia_covid) <- c("ID de caso", "Fecha de diagnóstico", "Ciudad de ubicación", "Departamento o Distrito", "Atención", "Edad" , "Sexo", "Tipo", "País de procedencia")
#colombia_covid$`Departamento o Distrito`[which(colombia_covid$`Departamento o Distrito` == "Valle Del Cauca")] <- "Valle del Cauca"
central.colombia.dep <- c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca",
"Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows <- which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid <- colombia_covid[central.colombia.rows, ]
colombia_covid <- colombia_covid[-which(colombia_covid$`Fecha de diagnóstico`== "-"), ]Description of variables
ID de caso: ID of the confirmed case.
Fecha de diagnóstico: Date in which the disease was diagnosed.
Ciudad de ubicación: City where the case was diagnosed.
Departamento o Distrito: Department or district where the city belongs to.
Atención: Situation of the patient: recovered, at home, at the hospital, at the ICU or deceased.
Edad: Age of the confirmed case.
Sexo: Sex of the confirmed case.
Tipo: How the person got infected: in Colombia, abroad or unknown.
País de procedencia: Country of origin if the person got infected abroad.
Map
Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.
Preprocessing
We had to clean the dataset:
We transformed the
Fecha de diagnósticovariable into aDatetype variable,we fixed the variable
Id de caso(since we removed some departments, so some lines, the numbers weren’t consecutive),we created a variable
Grupo de edad,we cleaned the column
País de procedencia(replaced cities with the country) and created the variableContinente de procedencia(as the first is too fragmented we thought to consider the continents).
## ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1 1 2020-03-06 Bogotá D.C. Bogotá D.C.
## 2 2 2020-03-09 Medellín Antioquia
## 3 3 2020-03-11 Medellín Antioquia
## Atención Edad Sexo Tipo País de procedencia Grupo de edad
## 1 Recuperado 19 F Importado Italia 19_30
## 2 Recuperado 50 F Importado España 46_60
## 3 Recuperado 55 M Relacionado Nan 46_60
New dataset I
## Date Elapsed time New cases/day Cumulative cases
## 1 2020-03-06 0 1 1
## 2 2020-03-09 3 1 2
## 3 2020-03-11 5 5 7
## 4 2020-03-12 6 2 9
## 5 2020-03-13 7 3 12
## 6 2020-03-14 8 14 26
## 7 2020-03-15 9 13 39
## 8 2020-03-16 10 8 47
## 9 2020-03-17 11 13 60
## 10 2020-03-18 12 9 69
New dataset II
## Date Elapsed time Department Department ID New cases/day
## 1 2020-03-09 3 Antioquia 1 1
## 2 2020-03-11 5 Antioquia 1 3
## 3 2020-03-14 8 Antioquia 1 3
## 4 2020-03-15 9 Antioquia 1 1
## 5 2020-03-19 13 Antioquia 1 3
## 6 2020-03-20 14 Antioquia 1 11
## 7 2020-03-21 15 Antioquia 1 3
## 8 2020-03-22 16 Antioquia 1 5
## 9 2020-03-23 17 Antioquia 1 22
## 10 2020-03-25 19 Antioquia 1 8
## Cumulative cases/Department Mean age
## 1 1 50.00000
## 2 4 35.66667
## 3 7 30.00000
## 4 8 55.00000
## 5 11 52.33333
## 6 22 39.81818
## 7 25 31.00000
## 8 30 45.40000
## 9 52 36.45455
## 10 60 29.75000
Exploring the dataset
Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):
the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.
on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.
The previous plot represents the daily incidence of the desease across all the departments we are taking into account.
Other plots
The major number of cases are in the capital Bogotà.
Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).
brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))
ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +
geom_bar(data = subset(colombia_covid, Sexo == "F")) +
geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) +
scale_y_continuous(breaks = brks,
labels = lbls) +
coord_flip() +
labs(title="Spread of the desease across genders",
y = "Number of cases",
x = "Department",
fill = "Gender") +
theme_tufte() +
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) +
scale_fill_brewer(palette = "Dark3") The desease (number of cases) is more or less equally distributed across genders.
#compute percentage so that we can label more precisely the pie chart
age_groups_pie <- colombia_covid %>%
group_by(`Grupo de edad`) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(`Grupo de edad`))
age_groups_pie$label <- scales::percent(age_groups_pie$per)
age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(`Grupo de edad`))) +
geom_bar(stat="identity", width = 1) +
theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill="Age groups",
x=NULL,
y=NULL,
title="Distribution of the desease across ages") +
coord_polar(theta = "y") +
#geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
guides(fill = guide_legend(title = "Group"))
age_pie People from 31 to 45 years old are the most affected by the disease and people over 76 years old are the least affected. Colombia is a very young country. In 2018 the median age of the population was 30.4 years old and the largest age group is made of people from 25 to 54 years old, which comprises 41.98% of the population. (https://www.indexmundi.com/colombia/demographics_profile.html)
Age-Sex plot
There isn’t much difference between the sexes among the different group of ages.
Tipo plot
theme_set(theme_classic())
ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
scale_fill_brewer(palette = "Set3") +
geom_histogram(aes(fill=Tipo), width = 0.8, stat="count") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Daily number of confirmed cases",
subtitle = "subdivided across type",
x = "Date of confirmation",
fill = "Type")Tipo
I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:
type_pie <- colombia_covid %>%
group_by(Tipo) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(Tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("Tipo", "Total number", "Percentage")
type_pie## # A tibble: 3 x 3
## Tipo `Total number` Percentage
## <chr> <int> <chr>
## 1 Relacionado 8570 12%
## 2 Importado 687 1%
## 3 En Estudio 60406 87%
The majority of the cases in the country are people that got infected inside Colombia. Then, people that contracted the disease abroad came mainly from Europe, followed by North America and Central America.
The frequentist approach
Train/test split
We splitted the data so to leave out the last three points for prediction, because we have few points and because in this models it has no sense to predict a week, because the situation changes really fast.
Poisson with Elapsed time as predictor
poisson1 <- glm(`Cumulative cases` ~ `Elapsed time`, data=data1[1:120, ], family=poisson)
pred.pois <- poisson1$fitted.values
res.st <- (data1$`Cumulative cases`[1:120] - pred.pois)/sqrt(pred.pois)
#n=120
#k=2
#n-k=118
print(paste("Estimated overdispersion", sum(res.st^2)/118))## [1] "Estimated overdispersion 126.060786858437"
poisson1.pred <- predict(poisson1, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson1.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 1760.50922940986"
#paste("MSE:", mean(poisson1$residuals^2))
#print(sprintf("MSE: %0.2f", sum(poisson1$residuals^2)/poisson1$df.residual))
#print(sprintf("MSE: %0.2f", anova(poisson1)['Residuals', 'Mean Sq']))
paste("AIC:", poisson1$aic)## [1] "AIC: 21915.0291925917"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1$null.deviance, deviance(poisson1)), 2))## [1] "Null deviance: 1740689.09" "Residual deviance: 20716.39"
Using cases_relev_dep
poisson1A <- glm(`Cumulative cases/Department` ~ `Elapsed time`, data=cases_relev_dep, family=poisson)
summary(poisson1A)##
## Call:
## glm(formula = `Cumulative cases/Department` ~ `Elapsed time`,
## family = poisson, data = cases_relev_dep)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -101.60 -47.19 -18.96 -8.85 355.34
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.185e+00 3.698e-03 1132 <2e-16 ***
## `Elapsed time` 3.492e-02 3.455e-05 1011 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5910918 on 1050 degrees of freedom
## Residual deviance: 4404273 on 1049 degrees of freedom
## AIC: 4411978
##
## Number of Fisher Scoring iterations: 6
## [1] "MSE: 3.77645088133327"
## [1] "AIC: 4411977.61248341"
We can see that the AIC is enormous.
Using New cases/day
poisson1B <- glm(`New cases/day` ~ `Elapsed time`, data=cases, family=poisson)
#print(paste("Estimated overdispersion", sum(res.st^2)/23))
par(mfrow=c(2,2))
plot(poisson1B)## [1] "MSE: 0.219324768650498"
## [1] "AIC: 8427.11326445979"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1B$null.deviance, deviance(poisson1B)), 2))## [1] "Null deviance: 95891.85" "Residual deviance: 7519.3"
Poisson with Elapsed time plus Elapsed time^2 as predictor
poisson1 <- glm(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2), data=cases[1:120, ], family=poisson)
pred.pois <- poisson1$fitted.values
res.st <- (cases$`Cumulative cases`[1:120] - pred.pois)/sqrt(pred.pois)
#n=120, k=2, n-k=118
print(paste("Estimated overdispersion", sum(res.st^2)/118))## [1] "Estimated overdispersion 74.6267856024938"
poisson1.pred <- predict(poisson1, newdata = cases[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson1.pred - cases$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 6940.37335779742"
#paste("MSE:", mean(poisson1$residuals^2))
#print(sprintf("MSE: %0.2f", sum(poisson1$residuals^2)/poisson1$df.residual))
#print(sprintf("MSE: %0.2f", anova(poisson1)['Residuals', 'Mean Sq']))
paste("AIC:", poisson1$aic)## [1] "AIC: 12062.7254007421"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1$null.deviance, deviance(poisson1)), 2))## [1] "Null deviance: 1740689.09" "Residual deviance: 10862.08"
Poisson with Elapsed time plus Sexo
poisson2 <- glm(`Cumulative cases` ~ `Elapsed time` + Sexo_M, data=data1[1:120, ], family=poisson)
poisson2.pred <- predict(poisson2, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson2.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 3892.80826581986"
## [1] "AIC: 20887.674847589"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson2$null.deviance, deviance(poisson2)), 2))## [1] "Null deviance: 1740689.09" "Residual deviance: 19687.03"
Poisson with Elapsed time plus Group de edad
poisson3 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1[1:120, ], family=poisson)
pred.pois3 <- poisson3$fitted.values
res.st3 <- (data1$`Cumulative cases` - pred.pois3)/sqrt(pred.pois3)
#n=120, k=7, n-k=113
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st3^2)/113))## [1] "Estimated overdispersion 439618.33736315"
poisson3.pred <- predict(poisson3, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson3.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 3249.31716470375"
## [1] "AIC: 20650.0253613547"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson3$null.deviance, deviance(poisson3)), 2))## [1] "Null deviance: 1740689.09" "Residual deviance: 19441.38"
Poisson with Elapsed time, Age and Departments as predictors
poisson4 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`, data=data1[1:120, ], family=poisson)
# MISSING `Departamento o Distrito_Valle del Cauca` !!!!!!!!!!!!!!!!!!!!!!!!
par(mfrow=c(2,2))
plot(poisson4)pred.pois4 <- poisson4$fitted.values
res.st4 <- (data1$`Cumulative cases` - pred.pois4)/sqrt(pred.pois4)
#n=25, k=19, n-k=6
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st4^2)/6))## [1] "Estimated overdispersion 9305019.32906713"
poisson4.pred <- predict(poisson4, newdata = data1[120:126, ], type="response")
paste("Real: ", data1$`Cumulative cases`[120:126], "Predict: ", poisson4.pred)## [1] "Real: 52763 Predict: 51352.1683340905"
## [2] "Real: 55062 Predict: 51926.3623882142"
## [3] "Real: 57218 Predict: 51407.7654815247"
## [4] "Real: 59929 Predict: 46143.4658501589"
## [5] "Real: 63731 Predict: 51890.9303272534"
## [6] "Real: 67003 Predict: 57193.1704598653"
## [7] "Real: 69663 Predict: 71578.2029401362"
## [1] "RMSE: 8243.70282563776"
## [1] "AIC: 18085.2709276108"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson4$null.deviance, deviance(poisson4)), 2))## [1] "Null deviance: 1740689.09" "Residual deviance: 16854.63"
Poisson with Elapsed time, Age and Departments as predictors for New cases/day
poisson4bis <- glm(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`, data=data1[1:120, ], family=poisson)
# MISSING `Departamento o Distrito_Valle del Cauca` !!!!!!!!!!!!!!!!!!!!!!!!
par(mfrow=c(2,2))
plot(poisson4bis)pred.pois4bis <- poisson4bis$fitted.values
res.st4bis <- (data1$`Cumulative cases` - pred.pois4bis)/sqrt(pred.pois4bis)
#n=120, k=19, n-k=101
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st4bis^2)/101))## [1] "Estimated overdispersion 6785702.57486846"
poisson4bis.pred <- predict(poisson4bis, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson4bis.pred - data1$`New cases/day`[120:126])^2)))## [1] "RMSE: 1225.26041309972"
## [1] "AIC: 2860.38819023372"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson4bis$null.deviance, deviance(poisson4bis)), 2))## [1] "Null deviance: 64369.08" "Residual deviance: 1979.14"
ANOVA to compare the Poisson models
## Analysis of Deviance Table
##
## Model 1: `Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2)
## Model 2: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+`
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 117 10862
## 2 113 19441 4 -8579.3
## 3 102 16855 11 2586.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Predictive accuracy of the Poisson model for Cumulative cases
Predicting with a \(95\%\) confidence interval
attach(data1)
conf.df_pois <- predict.confidence(poisson4, newdata = data1[1:120, ])
#conf.df_pois <- predict.glm(poisson4, newdata = data1, interval = "confidence", type="response")
print(conf.df_pois)## fit lwr upr
## 1 330.0746 326.5389 333.6486
## 2 375.9108 372.0358 379.8261
## 3 408.9914 404.8613 413.1637
## 4 429.0613 424.8173 433.3478
## 5 449.1625 444.8197 453.5477
## 6 465.5929 461.0866 470.1433
## 7 488.4038 483.7547 493.0975
## 8 510.6072 505.8188 515.4409
## 9 533.5229 528.5940 538.4977
## 10 552.9769 547.9174 558.0831
## 11 581.7205 576.5240 586.9639
## 12 602.9698 597.5202 608.4691
## 13 631.7498 626.2286 637.3196
## 14 664.0337 658.2573 669.8607
## 15 680.7526 674.7469 686.8116
## 16 721.6001 715.5526 727.6987
## 17 757.8006 751.1651 764.4947
## 18 785.9674 779.5862 792.4009
## 19 823.5685 816.8557 830.3365
## 20 858.9395 852.1669 865.7658
## 21 876.7155 869.4557 884.0359
## 22 930.6729 923.4462 937.9562
## 23 971.7989 964.3235 979.3322
## 24 1032.8012 1022.9102 1042.7877
## 25 1054.1709 1046.2887 1062.1126
## 26 1093.3457 1085.0495 1101.7053
## 27 1146.1746 1137.1944 1155.2258
## 28 1202.9545 1194.1638 1211.8099
## 29 1258.7105 1249.6868 1267.7994
## 30 1311.3175 1302.1746 1320.5246
## 31 1357.8835 1337.6157 1378.4585
## 32 1444.9156 1434.7098 1455.1941
## 33 1519.8684 1509.1450 1530.6679
## 34 1587.8759 1575.2161 1600.6374
## 35 1630.2011 1619.8580 1640.6103
## 36 1722.7991 1712.2271 1733.4363
## 37 1824.4607 1812.4555 1836.5453
## 38 1850.1033 1837.2203 1863.0766
## 39 1989.4668 1978.1516 2000.8467
## 40 1922.1853 1901.6820 1942.9096
## 41 2115.2571 2102.3406 2128.2530
## 42 2234.8920 2221.9995 2247.8592
## 43 2331.9074 2319.7325 2344.1463
## 44 2431.4378 2418.1501 2444.7984
## 45 2634.0492 2617.5718 2650.6302
## 46 2744.4903 2729.0463 2760.0217
## 47 2873.0845 2858.0022 2888.2463
## 48 2863.4143 2838.8978 2888.1425
## 49 3176.8999 3158.6408 3195.2647
## 50 3196.9225 3179.5268 3214.4134
## 51 3893.4344 3848.6899 3938.6991
## 52 3462.5208 3440.7520 3484.4273
## 53 3635.1333 3611.3954 3659.0273
## 54 4217.3130 4170.8403 4264.3034
## 55 3936.4731 3910.4026 3962.7174
## 56 4618.8287 4576.6449 4661.4013
## 57 4231.7340 4208.6675 4254.9269
## 58 5587.4719 5496.0193 5680.4463
## 59 4879.3421 4850.5114 4908.3440
## 60 4960.9549 4940.6730 4981.3201
## 61 5581.3981 5534.3548 5628.8414
## 62 5861.2124 5820.2560 5902.4569
## 63 5574.8777 5548.6184 5601.2613
## 64 5815.6063 5786.7861 5844.5700
## 65 5987.4306 5963.8100 6011.1446
## 66 6350.1873 6307.2567 6393.4101
## 67 6620.2329 6590.0085 6650.5959
## 68 6893.5614 6852.1030 6935.2707
## 69 7214.9504 7151.4212 7279.0440
## 70 7620.4264 7584.1069 7656.9199
## 71 7782.2056 7754.0042 7810.5094
## 72 8061.2257 8006.4048 8116.4220
## 73 8720.0235 8681.6719 8758.5445
## 74 8803.8471 8761.8373 8846.0582
## 75 9212.6083 9170.4593 9254.9509
## 76 9199.1815 9162.7615 9235.7462
## 77 9898.5171 9862.4445 9934.7216
## 78 9555.7379 9468.4566 9643.8238
## 79 10760.8789 10697.6546 10824.4769
## 80 11407.1952 11292.1149 11523.4482
## 81 11691.4576 11634.5050 11748.6889
## 82 12514.5215 12422.7444 12606.9767
## 83 12664.4871 12587.9120 12741.5281
## 84 12963.7257 12861.3620 13066.9042
## 85 13857.2611 13788.4567 13926.4089
## 86 15154.9754 15074.7667 15235.6108
## 87 15372.5241 15258.6298 15487.2686
## 88 15791.9001 15735.9156 15848.0838
## 89 17040.9553 16938.5110 17144.0193
## 90 17408.0264 17282.3735 17534.5929
## 91 17992.8830 17904.5528 18081.6489
## 92 18912.1907 18757.7983 19067.8539
## 93 19746.3114 19658.7003 19834.3129
## 94 20229.3760 20142.0126 20317.1184
## 95 21782.0187 21651.6274 21913.1952
## 96 21889.2743 21779.7893 21999.3096
## 97 21468.2660 21382.5634 21554.3121
## 98 22265.3032 22061.1384 22471.3574
## 99 24673.4640 24549.2250 24798.3317
## 100 25209.7074 25084.5878 25335.4510
## 101 25271.9867 25123.4190 25421.4330
## 102 26106.3282 25843.9985 26371.3207
## 103 28544.5258 28313.5482 28777.3876
## 104 29689.4939 29567.0004 29812.4948
## 105 29272.5660 28999.4995 29548.2037
## 106 31186.0038 30939.5415 31434.4293
## 107 35656.0034 35413.4774 35900.1903
## 108 35661.4994 35411.5718 35913.1910
## 109 34539.8290 34282.2070 34799.3868
## 110 36406.1526 36196.3948 36617.1260
## 111 37543.2985 37243.1994 37845.8157
## 112 40120.3228 39846.2368 40396.2941
## 113 41198.9581 40877.9624 41522.4745
## 114 46613.7677 46353.8635 46875.1292
## 115 40796.5908 40468.9743 41126.8596
## 116 45143.2410 44771.5259 45518.0422
## 117 46953.8524 46568.4419 47342.4527
## 118 47823.5956 47454.0505 48196.0185
## 119 51599.4931 51211.6249 51990.2989
## 120 51352.1683 50993.4456 51713.4146
n <- 120
freq_coverage_pois <- sum(`Cumulative cases` >= conf.df_pois[, 2] & `Cumulative cases` <= conf.df_pois[, 3])
freq_coverage_pois <- freq_coverage_pois/n
detach(data1)Predictive accuracy of the Poisson model for New cases/day
Predicting with a \(95\%\) confidence interval
Quasi Poisson with Elapsed time as predictor
poisson1quasi <- glm(`Cumulative cases` ~ `Elapsed time`, data=data1[1:120, ], family=quasipoisson)
par(mfrow=c(2,2))
plot(poisson1quasi)pred.poisq <- poisson1quasi$fitted.values
res.stq <- (data1$`Cumulative cases` - pred.poisq)/sqrt(summary(poisson1quasi)$dispersion*pred.poisq)
#n=120, k= ?, n-k=?
print(paste("Estimated overdispersion", sum(res.stq^2)/23))## [1] "Estimated overdispersion 15728.0601174826"
poisson1quasi.pred <- predict(poisson1quasi, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((poisson1quasi.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 1760.50922940986"
## [1] "AIC: NA"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1quasi$null.deviance, deviance(poisson1quasi)), 2))## [1] "Null deviance: 1740689.09" "Residual deviance: 20716.39"
Quasi Poisson with Elapsed time and Age as predictor
#Let's apply a quasi poisson and see what happens
poisson2quasi <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1[1:120, ], family=quasipoisson)
par(mfrow=c(2,2))
plot(poisson1quasi)pred.poisq2 <- poisson2quasi$fitted.values
res.stq2 <- (data1$`Cumulative cases` - pred.poisq2)/sqrt(summary(poisson2quasi)$dispersion*pred.poisq2)
#n=120, k= ?, n-k=?
print(paste("Estimated overdispersion", sum(res.stq2^2)/18))## [1] "Estimated overdispersion 21853.7140901657"
poisson2quasi.pred <- predict(poisson2quasi, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((poisson2quasi.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 3249.31716470375"
## [1] "AIC: NA"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson2quasi$null.deviance, deviance(poisson2quasi)), 2))## [1] "Null deviance: 1740689.09" "Residual deviance: 19441.38"
Negative Binomial with Elapsed time as predictor
#n=120, k=2, n-k=118
stdres <- rstandard(nb1)
print(paste("Estimated overdispersion", sum(stdres^2)/118))## [1] "Estimated overdispersion 177.301055416919"
nb1.pred <- predict(nb1, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb1.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 1765.66109000166"
## [1] "AIC: 21911.1770461798"
## [1] "Null deviance: 1738722.15" "Residual deviance: 20710.41"
Negative Binomial with Elapsed time plus Age as predictors
nb2 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1[1:120, ])
par(mfrow=c(2,2))
plot(nb2)nb2.pred <- predict(nb2, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb2.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 3253.44941446475"
## [1] "AIC: 20645.2695885301"
## [1] "Null deviance: 1738975.56" "Residual deviance: 19434.52"
Negative Binomial with Elapsed time plus Department as predictors
nb3 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`, data=data1[1:120, ])
par(mfrow=c(2,2))
plot(nb3)nb3.pred <- predict(nb3, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb3.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 3335.37126363787"
## [1] "AIC: 19270.2478115559"
## [1] "Null deviance: 1739087.22" "Residual deviance: 18047.5"
Negative Binomial with Elapsed time plus Continent of origin as predictors
# nb4 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`, data=data1[1:22, ])
# par(mfrow=c(2,2))
# plot(nb4)
# nb4.pred <- predict(nb4, newdata = data1[23:25, ], type = "response")
# paste("RMSE:", sqrt(mean((nb4.pred - data1$`Cumulative cases`[23:25])^2)))
# #paste("MSE:", mean(nb4$residuals^2))
# paste("AIC:", nb4$aic)
# paste(c("Null deviance: ", "Residual deviance:"),
# round(c(nb4$null.deviance, deviance(nb4)), 2))Negative Binomial with Elapsed time, Age and Departments as pedictors
nb5 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`, data=data1[1:120, ])
par(mfrow=c(2,2))
plot(nb5)# Calculating overdispersion n=120 k=19 n-k=101
stdres <- rstandard(nb5)
print(paste("Estimated overdispersion", sum(stdres^2)/101))## [1] "Estimated overdispersion 187.463666129657"
nb5.pred <- predict(nb5, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb5.pred - data1$`Cumulative cases`[120:126])^2)))## [1] "RMSE: 8253.89918445986"
## [1] "AIC: 18080.4893928766"
## [1] "Null deviance: 1739076.35" "Residual deviance: 16847.74"
Negative Binomial with Elapsed time, Age and Departments as pedictors
nb5bis <- glm.nb(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`, data=data1[1:120, ])
par(mfrow=c(2,2))
plot(nb5bis)# Calculating overdispersion n=120 k=19 n-k=101
stdres2 <- rstandard(nb5bis)
print(paste("Estimated overdispersion", sum(stdres2^2)/101))## [1] "Estimated overdispersion 1.62538490690911"
nb5bis.pred <- predict(nb5bis, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb5bis.pred - data1$`New cases/day`[120:126])^2)))## [1] "RMSE: 1379.56664915407"
## [1] "AIC: 1435.14403991019"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(nb5bis$null.deviance, deviance(nb5bis)), 2))## [1] "Null deviance: 1416.29" "Residual deviance: 146.46"
Applying ANOVA to compare the negative binomial models
We decided to compare nb1, nb2, nb5, because they are nested and we are more interested in seeing if the fifth model is in fact better than the first model.
## Likelihood ratio tests of Negative Binomial Models
##
## Response: Cumulative cases
## Model
## 1 Elapsed time
## 2 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n `Departamento o Distrito_Tolima`
## theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 11252780 118 -21905.18
## 2 12919543 113 -20629.27 1 vs 2 5 1275.907 0
## 3 13728063 102 -18042.49 2 vs 3 11 2586.780 0
Predictive accuracy of the Negative Binomial model
Predicting with a \(95\%\) confidence interval
Predictive accuracy of the NB model for New cases/day
Predicting with a \(95\%\) confidence interval
The Bayesian approach
Poisson regression
As a first attempt, we fit a simple Poisson regression:
\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]
with \(i = 1,\dots,134\), being \(134\) the number of rows of our dataset, and \(y_i\) represents the number of cases.
For what concerns the stan program, we used the function poisson_log_rng to describe the distribution of \(y_i\), namely the number of cases each day and the function poisson_log_lpmf to specify the likelihood.
Posterior predictive check
The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis.
Residual check
#in this way we check the standardized residuals
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at \(+2\) and \(-2\) indicates approximate \(95\%\) error bounds).
The plot of the standardized residuals indicates a large amount of overdispersion.
Classically the problem of having overdispersed data is solved using the negative binomial model instead of the Poisson’s one.
Negative Binomial model
We try to improve the previous model using the Negative Binomial model:
\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]
Where the parameter \(\phi\) is called precision and it is such that:
\[ E[y_i] = \lambda_i \\ Var[y_i] = \lambda_i + \frac{\lambda_i^2}{\phi} \]
again \(i=1,\dots,134\). As \(\phi \rightarrow \infty\) the negative binomial approaches the Poisson distribution.
The stan function that we use here are neg_binomial_2_log_rng to specify the distribution of \(y_i\) and the function neg_binomial_2_log_lpmf for the likelihood.
Posterior predictive check
samples_NB <- rstan::extract(fit2)
y_rep <- samples_NB$y_rep
ppc_dens_overlay(y = model.data$cases, y_rep[1:200,]) Residual check
mean_inv_phi <- mean(samples_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)The situation is better now, but still we have too many residuals outside the \(95\%\) interval.
Accuracy across departments
ppc_stat_grouped(
y = model.data$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "mean",
binwidth = 0.2
)We should take into account the differences across departments.
Multilevel Negative Binomial regression
We try to fit the following model, which also includes Age as covariat:
\[ ln(\lambda_i) = \alpha + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \]
Posterior predictive check
samples_NB2 <- rstan::extract(fit3)
y_rep <- samples_NB2$y_rep
ppc_dens_overlay(y = model.data2$cases, y_rep[1:200,]) Residual check
mean_inv_phi <- mean(samples_NB2$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data2$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)Accuracy across departments
ppc_stat_grouped(
y = model.data2$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "mean",
binwidth = 0.2
)Hierarchical model
In order to improve the fit, we fit a model with department-specific intercept term.
So the varying intercept model that we take into account is now:
\[ ln(\lambda_{i,d}) = \alpha_d + + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age_i\\ \alpha_d \sim \mathcal{N}(\mu + \beta_{pop}\cdot pop_d + \beta_{sur}\cdot surface_d + \beta_{dens} \cdot density_d, \sigma_{\alpha})\\ y_i \sim \mathcal{Negative Binomial}(\lambda_{i,d}, \phi) \]
The priors used for the above model are the following:
\[ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \\ \psi \sim \mathcal{N}(0,1) \]
being \(\psi = [\beta_{pop}, \beta_{sur}, \beta_{dens}]\).
New dataset
We added the following covariats into the dataset:
People: millions of inhabitants for each region;Surface: \(km^3\), extent of each region;Density: \(\frac{people}{km^2}\), density of the population in each region.
The model is:
Posterior predictive check
samples_hier <- rstan::extract(fit.4)
y_rep <- samples_hier$y_rep
ppc_dens_overlay(y = data.hier.NB.complete$cases, y_rep[1:200,]) Residual check
mean_inv_phi <- mean(samples_hier$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (data.hier.NB.complete$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)Very few points are now outside the \(95\%\) confidence interval.
Accuracy across departments
ppc_stat_grouped(
y = data.hier.NB.complete$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "mean",
binwidth = 0.2
)We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.
LOOIC
The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.
Plot the looic to compare models:
loo.all.deps<-c(loo.model.Poisson[3], loo.model.NB[3], loo.model.NB2[3], loo.model.NB.hier[3])
sort.loo.all.deps<- sort.int(loo.all.deps, index.return = TRUE)$x
par(xaxt="n")
plot(sort.loo.all.deps, type="b", xlab="", ylab="LOOIC", main="Model comparison")
par(xaxt="s")
axis(1, c(1:4), c("Poisson", "NB-sl", "NB-ml",
"hier")[sort.int(loo.all.deps,
index.return = TRUE)$ix],
las=2)